Project Description¶

Fraud in the banking sector does not only lead to financial losses, but it can also damage a company’s reputation and weaken customer trust. In this project, we aim to detect suspicious banking transactions, highlight hidden anomalies, and help protect the integrity of financial systems.

Our goal is to identify transactions that look unusual or do not follow typical patterns. These may be early signs of fraudulent behavior. By analyzing this data, we can support the organization in taking quick action, improving security, and maintaining the confidence of its customers.

To do this, we will use the Isolation Forest (IForest) algorithm from the pyod.models library. This model is designed to detect outliers in large datasets and is well-suited for finding rare and suspicious events.

Our task involves:

  • Applying the IForest model to transaction data.
  • Identifying anomalies that could innicate potential fraud.
  • Summarizing the results to provide clear insights.
  • Supporting the company in making faster and better fraud-related decisions.

By the end of this project, we aim to provide a reliable method for monitoring transactions and reducing the risk of fraud through early detection and meanigful insights.

The Data¶

We will work with a dataset containing information about financial transactions. Below is a summary of the key columns provided:

Column Description
TransactionID A unique identifier for each transaction.
TransactionAmount The amount of money involved in the transaction (in USD).
TransactionDuration Duration of the transaction (in seconds).
AccountBalance The balance of the account after the transaction was processed (in USD).
In [2]:
# Import required libraries
import pandas as pd
import numpy as np
from pyod.models.iforest import IForest
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
transactions = pd.read_csv("transactions.csv")

# Isolate key columns
columns = ["TransactionID", "TransactionAmount", "TransactionDuration", "AccountBalance"]
transactions = transactions[columns]

# Display the first rows
transactions.head()
Out[2]:
TransactionID TransactionAmount TransactionDuration AccountBalance
0 TX000001 14.09 81 5112.21
1 TX000002 376.24 141 13758.91
2 TX000003 126.29 56 1122.35
3 TX000004 184.50 25 8569.06
4 TX000005 13.45 198 7429.40

Compute an anomaly score for each transaction.¶

In [3]:
# Select numerical features for anomaly detection
features = transactions[["TransactionAmount", "TransactionDuration", "AccountBalance"]]

# Train an IForest model
model = IForest(n_estimators=100, contamination=0.05, random_state=42)
model.fit(features)

# Add the anomaly scores to the dataset
transactions["Anomaly_Score"] = model.decision_function(features)

transactions.head()
C:\Users\newbe\anaconda3\Lib\site-packages\sklearn\utils\validation.py:2732: UserWarning: X has feature names, but IsolationForest was fitted without feature names
  warnings.warn(
Out[3]:
TransactionID TransactionAmount TransactionDuration AccountBalance Anomaly_Score
0 TX000001 14.09 81 5112.21 -0.127230
1 TX000002 376.24 141 13758.91 -0.043850
2 TX000003 126.29 56 1122.35 -0.150263
3 TX000004 184.50 25 8569.06 -0.105860
4 TX000005 13.45 198 7429.40 -0.096070

Which transactions are flagged as anomalies?¶

In [4]:
# Flag transactions as anomalies based on the model's prediction
transactions["Anomaly"] = (model.predict(features) == 1).astype(int)

transactions.head()
C:\Users\newbe\anaconda3\Lib\site-packages\sklearn\utils\validation.py:2732: UserWarning: X has feature names, but IsolationForest was fitted without feature names
  warnings.warn(
Out[4]:
TransactionID TransactionAmount TransactionDuration AccountBalance Anomaly_Score Anomaly
0 TX000001 14.09 81 5112.21 -0.127230 0
1 TX000002 376.24 141 13758.91 -0.043850 0
2 TX000003 126.29 56 1122.35 -0.150263 0
3 TX000004 184.50 25 8569.06 -0.105860 0
4 TX000005 13.45 198 7429.40 -0.096070 0

Create a summary of anomalous transactions.¶

In [5]:
# summary of anomalous transactions
anomalies_summary = transactions.loc[transactions["Anomaly"] == 1, ["TransactionID", "TransactionAmount", "TransactionDuration", "AccountBalance"]]

anomalies_summary.head()
Out[5]:
TransactionID TransactionAmount TransactionDuration AccountBalance
41 TX000042 34.02 19 14214.48
74 TX000075 1212.51 24 605.95
85 TX000086 1340.19 30 8654.28
141 TX000142 1049.92 21 2037.85
146 TX000147 973.39 296 2042.22

What is the distribution of TransactionAmount for normal and anomalous transactions?¶

In [6]:
# Plot the distribution of TransactionAmount for normal and anomalous transactions
plt.figure(figsize=(8, 6))
transactions[transactions["Anomaly"] == False]["TransactionAmount"].hist(bins=30, alpha=0.5, label="Normal", color="blue")
transactions[transactions["Anomaly"] == True]["TransactionAmount"].hist(bins=30, alpha=0.5, label="Anomalous", color="red")
plt.title("Transaction Amount Distribution")
plt.xlabel("Transaction Amount")
plt.ylabel("Frequency")
plt.legend()
plt.savefig("anomalies_histogram.png")
No description has been provided for this image